Skip to content

Conversation

@Jasmine-Yuting-Zhang
Copy link
Collaborator

This PR introduces comprehensive integration of the Nanochat language model stack into Plato, enabling federated learning experiments with GPT models (e.g, Nanochat). This integration adds Nanochat as a submodule, and adds the CORE benchmark for language model evaluations.

Description

Third-party submodule and model integration:

  • Nanochat submodule (external/nanochat): Git submodule integration of karpathy/nanochat.
  • Model factory (plato/models/nanochat.py): Nanochat model with configurable architecture parameters, checkpoint loading, and automatic tokenizer attachment.
  • Registry integration (plato/models/registry.py): Registered "Nanochat" model type for seamless configuration-based instantiation.

Tokenizer and data processing:

  • Rust tokenizer processor (plato/processors/nanochat_tokenizer.py): Wrapper for rustbpe + tiktoken stack with special token support and corpus training capabilities.
  • Streaming datasource (plato/datasources/nanochat.py): Configurable datasource supporting both real parquet data and synthetic token generation with automatic fallback.
  • Registry integration (plato/datasources/registry.py): "Nanochat" datasource registered for TOML configuration.

Training infrastructure:

  • Composable trainer (plato/trainers/nanochat.py): Specialized trainer with Nanochat-specific data loading, training steps, and optimizer strategies.
  • Multiple training strategies: Custom data loader, training step, optimizer, and testing strategies tailored for Nanochat models.
  • CORE evaluation integration: Built-in support for nanochat_core evaluation type with automatic benchmark execution.

Evaluation framework:

  • CORE benchmark adapter (plato/evaluators/nanochat_core.py): Complete port of nanochat/core_eval.py with automatic bundle download, task loading, and metric computation.
  • Comprehensive evaluation: Support for 22 language model evaluation tasks with centered accuracy metrics.

Configuration and examples:

  • Configuration templates (plato/configs/Nanochat/): Ready-to-use synthetic and extended evaluation configurations.
  • Example workspace (plato/examples/nanochat/): Documentation, setup instructions, and quickstart guides.

How has this been tested?

Tested CORE benchmark evaluation with configuration file synthetic_micro.toml.
Test execution and results:
Command:

uv run --extra nanochat python plato.py --config configs/Nanochat/synthetic_micro.toml

Output showing successful CORE benchmark evaluation on 22 tasks after 1 round of FL training session:

[INFO][23:22:05]: [Server #27388] Started model testing.
[INFO][23:23:10]: CORE task hellaswag_zeroshot | accuracy 0.2500 | centered 0.0000 | 64.92s
[INFO][23:23:38]: CORE task jeopardy | accuracy 0.0000 | centered 0.0000 | 27.83s
[INFO][23:24:03]: CORE task bigbench_qa_wikidata | accuracy 0.0000 | centered 0.0000 | 25.45s
[INFO][23:25:09]: CORE task arc_easy | accuracy 0.2500 | centered 0.0000 | 65.78s
[INFO][23:26:13]: CORE task arc_challenge | accuracy 0.1875 | centered -0.0833 | 63.81s
[INFO][23:26:32]: CORE task copa | accuracy 0.7500 | centered 0.5000 | 19.09s
[INFO][23:27:35]: CORE task commonsense_qa | accuracy 0.1250 | centered -0.0938 | 63.70s
[INFO][23:28:18]: CORE task piqa | accuracy 0.5000 | centered 0.0000 | 42.78s
[INFO][23:28:38]: CORE task openbook_qa | accuracy 0.5000 | centered 0.3333 | 20.04s
[INFO][23:28:59]: CORE task lambada_openai | accuracy 0.0000 | centered 0.0000 | 21.03s
[INFO][23:30:02]: CORE task hellaswag | accuracy 0.2500 | centered 0.0000 | 62.35s
[INFO][23:30:23]: CORE task winograd | accuracy 0.5000 | centered 0.0000 | 20.95s
[INFO][23:30:44]: CORE task winogrande | accuracy 0.6250 | centered 0.2500 | 21.36s
[INFO][23:31:12]: CORE task bigbench_dyck_languages | accuracy 0.0000 | centered 0.0000 | 27.69s
[INFO][23:32:16]: CORE task agi_eval_lsat_ar | accuracy 0.3750 | centered 0.2187 | 63.91s
[INFO][23:32:43]: CORE task bigbench_cs_algorithms | accuracy 0.0000 | centered 0.0000 | 27.70s
[INFO][23:33:11]: CORE task bigbench_operators | accuracy 0.0000 | centered 0.0000 | 28.04s
[INFO][23:33:39]: CORE task bigbench_repeat_copy_logic | accuracy 0.0000 | centered 0.0000 | 27.48s
[INFO][23:34:06]: CORE task squad | accuracy 0.0000 | centered 0.0000 | 27.23s
[INFO][23:34:34]: CORE task coqa | accuracy 0.0000 | centered 0.0000 | 27.86s
[INFO][23:35:18]: CORE task boolq | accuracy 0.4375 | centered -0.4803 | 43.64s
[INFO][23:36:25]: CORE task bigbench_language_identification | accuracy 0.0625 | centered -0.0314 | 67.52s
[INFO][23:36:25]: [Server #27388] Average Centered CORE benchmark metric: 2.79%

Types of changes

  • Bug fix (non-breaking change which fixes an issue) Fixes #
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code has been formatted using the Ruff formatter (ruff format) and checked using the Ruff linter (ruff check --fix).
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

baochunli and others added 16 commits October 28, 2025 08:51
- Resolved a RuntimeError caused by non-contiguous tensors during view operations (in nanochat - gpt.py):
"view size is not compatible with input tensor's size and stride...". Replaced .view() with .reshape()
- Resolved an issue where the configuration requested 'train_loss' in the results, but the server's get_logged_items() did not include it.
- To avoid vocabulary size mismatch between model and tokenizer during CORE evaluation.
- Updated log message from "global accuracy" to "Average Centered CORE benchmark metric"
- Used ruff to format code
- Added instructions for initializing submodules and resolving maturin build failure.
@netlify
Copy link

netlify bot commented Nov 11, 2025

Deploy Preview for platodocs canceled.

Name Link
🔨 Latest commit 2b7cf3d
🔍 Latest deploy log https://app.netlify.com/projects/platodocs/deploys/69164195a373a30008aee243

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants